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Event-Driven Optimal Feedback Control for 
Multi- Antenna Beamforming 

Kaibin Huang, Vincent K. N. Lau, and Dongku Kim 

Abstract 

Transmit beamforming is a simple multi-antenna technique for increasing throughput and the trans- 
mission range of a wireless communication system. The required feedback of channel state information 
(CSI) can potentially result in excessive overhead especially for high mobility or many antennas. This 
work concerns efficient feedback for transmit beamforming and establishes a new approach of controlling 
feedback for maximizing net throughput, defined as throughput minus average feedback cost. The 
feedback controller using a stationary policy turns CSI feedback on/off according to the system state 
that comprises the channel state and transmit beamformer. Assuming channel isotropy and Markovity, 
the controller's state reduces to two scalars. This allows the optimal control policy to be efficiently 
computed using dynamic programming. Consider the perfect feedback channel free of error, where each 
feedback instant pays a fixed price. The corresponding optimal feedback control policy is proved to be 
of the threshold type. This result holds regardless of whether the controller's state space is discretized or 
continuous. Under the threshold-type policy, feedback is performed whenever a state variable indicating 
the accuracy of transmit CSI is below a threshold, which varies with channel power. The practical finite- 
rate feedback channel is also considered. The optimal policy for quantized feedback is proved to be 
also of the threshold type. The effect of CSI quantization is shown to be equivalent to an increment on 
the feedback price. Moreover, the increment is upper bounded by the expected logarithm of one minus 
the quantization error. Finally, simulation shows that feedback control increases net throughput of the 
conventional periodic feedback by up to 0.5 bit/s/Hz without requiring additional bandwidth or antennas. 

Index Terms 

Array signal processing, stochastic optimal control, feedback communication, time-varying channels, 
dynamic programming, Markov processes 

I. Introduction 

Transmit beamforming is a popular multi-antenna technique for enhancing the reliability and throughput 
of a wireless communication link [1]. In many systems, transmit beamforming requires feedback of 
channel state information (CSI), incurring significant overhead especially for a large number of transmit 
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antennas or fast fading [2] , [3] . In this paper, we consider the transmit beamforming system and propose 
a new approach of maximizing net throughput, defined as throughput minus average feedback cost, via 
optimal feedback control. The controller under consideration turns CSI feedback on/off by observing 
the current system state that consists of the current channel state and transit beamformer. The optimal 
stationary control policy is shown to be of the threshold type. As a result, feedback is performed whenever 
transmit CSI is sufficiently outdated as measured by an optimal threshold function. Optimal feedback 
control is observed to substantially increase net throughput of the transmit beamforming system compared 
with the conventional periodic feedback [2], [4], [5]. 

A. Prior Works 

In multi-antenna systems, adaptive transmission techniques such as beamforming and precoding typi- 
cally require periodic feedback of complex vectors or matrices derived from CSI. The potentially large 
feedback overhead has motivated active research on intelligent algorithms for quantizing feedback CSI, 
forming a research area called limited feedback [3]. Different approaches for quantizing CSI have been 
proposed, including line packing [4], [5], combined channel parameterization and scalar quantization 
[6], subspace interpolation [7], and Lloyd's algorithm [8], [9]. Furthermore, various types of limited 
feedback systems have been designed, namely beamforming [4], [5], precoded orthogonal space-time 
block codes [10], precoded spatial multiplexing [11], and multiuser downlink [12]. The practicality of 
limited feedback has been recognized by the industry and related techniques have been integrated into 
latest wireless communication standards such as IEEE 802.16 [13] and 3GPP LTE [14]. 

Besides quantization, CSI feedback can be compressed by exploiting channel temporal correlation [6], 
[15], [16]. In [6], each CSI matrix for a multiple-input-multiple-output (MIMO) channel is parameterized 
and the parameters are sent back incrementally using the delta modulation. In [15], the feedback CSI 
matrix is compressed to be one bit indicating the channel variation with respect to a reference matrix sent 
by the transmitter. A lossy feedback compression algorithm is proposed in [16], which reduces feedback 
overhead by omitting in feedback the infrequent transitions between CSI states. In view of prior works, 
it remains unknown that how the average feedback cost can be minimized for given throughput. 

The applications of opportunism [17], [18] to CSI feedback have resulted in opportunistic feedback 
algorithms for reducing sum feedback overhead in multi-user multi-antenna systems [19]— [23]. The 
common feature of these algorithms is that CSI feedback is performed only if a channel quality indicator 
exceeds a fixed threshold. Compared with periodic feedback over dedicated channels (see e.g. [4], [5]), 
opportunistic (aperiodic) feedback is much more efficient in terms of sum feedback overhead and thus 
is suitable for systems where users randomly access a common feedback channel. The thresholds for 
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opportunistic feedback can be computed iteratively for maximizing throughput as in [19], [20] or derived 
in closed-form expressions for achieving optimal capacity scaling for asymptotically large numbers of 
users [21]-[24]. For simplicity, the temporal correlation in practical channels is omitted in existing designs 
where independent block fading is assumed. Thus the existing opportunistic feedback algorithms are 
incapable of adapting feedback thresholds to channel dynamics for further feedback reduction. 

The common objective of the works mentioned above is to maximize throughput. This performance 
metric fails to account for feedback cost though feedback competes with data transmission for resources 
including time, bandwidth and power. Thus net throughput defined earlier is a more practical metric. 
In [25]-[27], net throughput is maximized by optimizing the resource allocation to data transmission 
and feedback. In [25], a two-way beamforming system is considered, where data and CSI flow in both 
directions of the link between two multi-antenna transceivers. For this system, bounds on the feedback 
rate for maximizing net throughput are derived. For a similar system, net throughput is maximized in [26] 
by optimizing power allocation to training, feedback and data transmission. Net throughput optimization 
for the beamforming system is also investigated in [27] in terms of optimal bandwidth allocation to 
feedback and data transmission. Aligned with the direction of prior works, the current paper addresses 
net throughput maximization for transmit beamforming from the new perspective of feedback control, 
which adapts the mentioned resource allocation to channel dynamics. 

B. Contributions and Organization 

In this paper, we consider a single-user transmit beamforming system with multiple transmit and a 
single receive antennas. Each feedback instant incurs fixed cost in bit/s/Hz, called feedback price. A 
feedback controller turns the feedback link either on or off such that net throughput is maximized. This 
work is based on the following assumptions. First, channel realizations form a stationary Markov chain. 
Second, the channel coefficients are i.i.d. complex Gaussian random variables. This assumption allows the 
state of the feedback controller to reduce to two scalars g and z without compromising the controller's 
optimality. The parameter g is the channel power and z the squared cosine of the angle between the 
transmit beamformer and the channel vector. Large z indicates accurate transmit CSI and vice versa 
[4], [5]. Finally, the distribution of z in the next slot conditioned on a realization a in the current slot is 
assumed to stochastically dominate the counterpart conditioned on b < a [28]. Essentially, this assumption 
implies that z being large in a slot likely remains large in the next slot. 

The contributions of this paper are summarized as follows. In general, the paper establishes a new 
approach for controlling feedback for transmit beamforming to maximize net throughput. To efficiently 
compute the optimal control policy using dynamic programming (DP) [29], the state space of the 
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controller, namely the product space of (g,z), is quantized. 1 Consider the perfect feedback channel 
free of feedback error. First, given the quantized state space, the feedback control policy for maximizing 
net throughput is proved to be of the threshold type. Specifically, feedback is performed only if z is 
below the optimal threshold that depends on g. Second, the threshold type policy is proved to be optimal 
for feedback control with the continuous (unquantized) state space. Next, we consider the finite feedback 
channel that requires feedback CSI quantization. Fourth, the optimality of the threshold-type feedback 
control policy is proved for quantized feedback. Feedback CSI quantization reduces the receive SNR and 
also varies the dynamics of z. Fifth, to gain insight into these two effects, they are treated separately and 
each of them is shown to decrease net throughput. Finally, we show that the effect of CSI quantization 
on net throughput can be interpreted as an increment on the feedback price. This increment is upper 
bounded by the expected logarithm of one minus the quantization error. 

Simulation results are also presented for the channel model specified by i.i.d. Rayleigh fading and 
Clarke's temporal correlation. Define the feedback gain as throughput for free feedback minus that for 
no feedback. With respect to periodic feedback, optimal controlled feedback is observed to increase the 
feedback gain by up to 0.5 bit/s/Hz, equal to 24% of the feedback gain. The increase in net throughput 
is insensitive to the variation on Doppler frequency and the number of transmit antennas. For both 
perfect and imperfect feedback channels, the optimal feedback control policies computed numerically are 
observed to exhibit the threshold structure as predicted analytically. Moreover, the feedback threshold 
decreases with the increasing feedback price, corresponding to less frequent feedback. Last, feedback 
quantization is observed to reduce ergodic throughput as well as the feedback threshold, decreasing the 
feedback frequency. 

The remainder of this paper is organized as follows. The system model is described in Section II. 
The optimal feedback control policies for the perfect and finite-rate feedback channels are analyzed in 
Section IV and V, respectively. Simulation results are presented in Section VI followed by concluding 
remarks in Section VII. 

Notation: A matrix is represented by a boldface capitalized letter and a vector by a boldface small 
letter. The (m, n)th element of a matrix X is represented by [X] m>n . For a vector x, [x] m gives the mth 
element. The superscript f denotes the complex conjugate transpose operation on a matrix or a vector. 
Define the operator (a) + on a scalar a as (a) + := max(0,a). The realization of a stochastic process in 
the tth time slot is specified by the subscript t. 

'This quantization differs from that for finite-rate feedback considered in Section V and thus has no effect on the quality of 
feedback CSI. 
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Fig. 1. Transmit beamforming system with controlled CSI (channel shape) feedback 

II. System Model 

We consider the transmit beamforming system illustrated in Fig. 1, where a transmitter with L antennas 
transmits to a receiver with a single antenna. The frequency-flat channel is a L x 1 complex vector denoted 
as h. To facilitate our designs, h is decomposed into the channel power g := ||h|| 2 and the channel shape 
s := h/||h||, which are the indicators of the channel quality and direction, respectively. It follows that 
h = yfgs. It is well-known that applying s as the beamforming vector, denoted as f, maximizes the 
receive signal-to-noise ratio (SNR) [1]. To this end, s is estimated by the receiver at the beginning of 
each time slot of T c seconds and communicated to the transmitter via the feedback channel. 2 The channel 
is assumed constant within each slot and thus feedback is performed at most once per slot. 3 Depending 
on the channel state h (or g and s), the feedback controller turns CSI feedback on/off at the beginning of 
each time slot. Let U := {0, 1} denote the control state space and \i G U the feedback decision, where 1 
and correspond to the on and off states of the feedback link. Define the controller's state x := (g, s, f) 
that contains all system variables affecting net throughput obtained in the sequel. Thus the state space 
is X := x L x <D L where L represents the unit hypersphere embedded in C L [5]. We consider 
a stationary feedback control policy V : X —> U independent of the slot index t [29]. It is assumed 
that per usage of the feedback channel incurs the feedback cost of B bit. 4 Moreover, the transmission 



Besides CSI bits, feedback contains an extra bit identifying the feedback instant if the feedback channel is assigned to a 
single user or multi-bit user identity if multiple users share the feedback channel. 
3 This requires that T c is shorter than channel coherence time. 

4 The parameter B measures the equivalent number of data bits that can be transmitted reliably using the resources allocated 
to one-time feedback. Feedback CSI is delay sensitive and thus cannot be protected by strong error correcting codes as their 
decoding delay is too long. Therefore, feedback CSI is typically transmitted using larger power and lower-order modulation than 
those for data transmission. As a result, the communication cost of one CSI bit is higher than that of one data bit (B > 1). 
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time of feedback CSI is assumed negligible. 5 We consider long data codewords covering many channel 
realizations. Given channel ergodicity and stationary feedback control, the net throughput in bit/s can be 
written as [25] 



R= ^ 



log 2 (l + P<7|stf| 2 )] -^PrO^l) (1) 



where P is the transmit SNR and T s the symbol duration. For simplicity, net throughput can be written 
in bit/s/Hz as 



J = E 



log 2 (l + P 5 |stf| 2 )l -aPr( M =l) (2) 



where a := ^f 5 - is called the feedback price. 6 

The feedback controller controls the transmit beamformer via feedback and thereby influences the 
receive SNR. Define z := |stf| 2 that represents the controllable component of the receive SNR SNR r = 
Pgz [4], [5]. Consider the perfect feedback channel. The temporal variation of z under feedback control 
is illustrated in Fig. 2. Upon CSI feedback, f is updated with s and the value of z is reset to the maximum 
of one; if feedback is turned off, z is smaller than one due to that f fails to adapt instantaneously to the 
time varying s. With f fixed, the probability density function (PDF) of z is referred to as the uncontrolled 
PDF and denoted as f(z | f). The uncontrolled PDF governs the dynamics of z between two consecutive 
feedback instants (cf. Fig. 2). The random variable z is used later as a controller state variable. In addition, 
besides the mentioned perfect feedback channel, the one with a finite -rate constraint is also considered 
in the sequel. 

We make several assumptions on the channel distribution as described shortly. These assumptions 
facilitate computing the optimal feedback control policy V* using DP [29] and analyzing the policy 
structure. Model the channel as a stochastic sequence denoted as ho, hi , h 2 , . . . , where h t is the channel 
state in the tth slot. 

Assumption 1: The sequence ho,hi,h2, ... is a stationary Markov chain. 
In other words, given h n , h n+ i is independent of the past realizations h n _i, h n _ 2 , ■ ■ ■ . Markov chains 
are commonly used for modeling temporally-correlated wireless channels (see e.g., [30]-[34]). Markov 
channel models have been validated both analytically (see e.g., [30], [35]) and by measurement [36]. Next, 
the controller's state space X have (2L+1) dimensions. Due to the curse of dimensionality, computing V* 

5 In practice, feedback CSI is treated as control signals and transmitted in the header that occupies a small fraction of each 
slot. This justifies the omission of CSI transmission time. 

6 The value of a is large if power allocated to feedback or the number of channel coefficients are large, or the symbol rate is 
low and vice versa. 
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Fig. 2. Temporal variation of z under feedback control 

is impractical if L is large [29]. The following assumption overcomes this difficulty, which is commonly 
made in the literature (see e.g. [37], [38]). 

Assumption 2: The channel h comprises i.i.d. CM(0, 1) random variables. 
Given this assumption, g follows chi-square distribution with L complex degrees of freedom [39]; s is 
isotropic. As a result, the uncontrolled distribution of z is independent of f and hence we can write f(z | f ) 
as f(z). 7 It follows that the state variables s and f can be combined into z. Thus the controller's state 
and state space reduce to x = (g, z) and X = R + x Z with Z := [0, 1], respectively, thereby overcoming 
the mentioned curse of dimensionality. We make the following assumption on the temporal correlation 
of z that affects the structure of V* (cf. Section IV). 

Assumption 3: For 1 > a > b > 0, the uncontrolled distribution of z t+ \ conditioned on z t = a is 
stochastically dominant [28] over that conditioned on z t = b. Mathematically, 



where < c < 1. 

This assumption essentially states that large z t likely leads to large zt+i and vice versa, which is reasonable 
given channel temporal correlation. This work requires no assumption on the temporal correlation of g. 

Finally, let f(z t +i \ z t , pt) and f(gt+i | gt) denote the transition PDF's of z and g respectively. For 
convenience, the state transition PDF is written as f x := f x /. 



In this section, the problems of optimal feedback control are formulated and solved in the subsequent 
sections. Both perfect and finite-rate feedback channels are considered in the problem formulation. 

7 Given Assumption 2, the uncontrolled distribution of z conditioned on an arbitrary f is Pr(z > r) = (1 — r) L_1 [5], [40]. 




III. Problem Formulation 



Huang et al. : Controlled Feedback for Multi- Antenna Beamforming 



8 



The generic average and discounted reward problems are denned as follows [29]. By abuse of notation, 
the symbols in the preceding section are reused here. Consider a dynamic system with an infinite number 
of stages (infinite horizon), a state space X and a control space U. The system dynamics are specified 
by the state transition kernel f x (xt | xt,Ht) with x G X and /j, eU. The reward-per-stage is represented 
by the function G : X x U —> M, + . The control policy V : X — > U is optimized for maximizing either 
the average reward 

1 T_1 
t=o 

or the discounted reward 

oo 

Jp(x Q ):=Y J P t nG{x ulH )\x \ (4) 
t=o 

where < (3 < 1 is the discount factor. This corresponding optimization problems are called the infinite- 
horizon average and discounted reward problems represented by A(X,U, f x , G) and V(X,U, f x ,G), 
respectively. These problems can be solved iteratively using DP [29]. 

Consider the perfect feedback channel. Net throughput in (2) can be written as the average reward in 
(3) with G given by 

log 2 (l + Pg) - a, M=l 



G(g,z,n) :-- 



(5) 

log 2 (l + Pgz), otherwise. 



Hereafter the terms, average reward and net throughput, are used interchangeably. Thus the feedback 
controller can be designed by solving A(X,U, f x ,G) with f x obtained as follows 

f x (x t +i | x t ,nt) = f x ((g t +i,z t+ i) | (gt,z t )) 

= f(9t+i I z t+1 ,g t ,z t ,n t )f(z t+1 | g t ,z t ,nt) 

- f(gt+i I gt)f{z t+1 1 z t ,n t ) (6) 

where (a) follows from the channel isotropy. From the discussion in Section II 

f( z t+i I zt = 1), Mi = 1, 
f(zt+i I Zt = c), otherwise 
with c£2. The optimal policy V* for solving A(X,U, f x , G) is analyzed in Section IV. 8 

8 The problem of net throughput maximization can be also formulated as the multi-objective optimization problem of 
maximizing ergodic throughput E[log 2 (l + Pgz)] and minimizing the feedback rate aPr(/i = 1). Note that the average 
feedback rate is proportional to the feedback probability. The multi-objective reward function can be modified from (2) by 
replacing a with Aa with A > being the weight factor. Varying A varies the relative importance of throughput maximization 
and feedback rate reduction. Solving the multi-objective optimization problem with varying A gives the maximum throughput 
as a function of the average feedback rate. However, proving the Pareto optimality [41] of this function seems difficult. 



f(z t+ i \zt = C, lit) = < 
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In practice, CSI feedback is mplemented using a narrow-band (finite -rate) control channel [2], [13], 
[14]. The finite-rate feedback constraint requires quantizing feedback CSI as described shortly. Let s 
denote the L x 1 complex unitary vector resulting from quantizing s [4], [5]. We consider a codebook 
based quantizer where the codebook T is a set of complex unitary vectors [4], [5]. Using T, s is obtained 
by quantizing s using the maximum SNR criterion, namely that s = max xG jc js^x) 2 . Define e := |s+s| 2 
where < e < 1. This scalar quantifies the loss on the receive SNR (SNR e = Pge) upon CSI feedback 
compared with that (SNR = Pg) for the perfect feedback channel [4], [5], [40]. Note that e is equal 
to one minus the quantization error defined in [40], [42]. The distribution of e depends on the channel 
distribution and design of the quantizer codebook [4], [5], [40]. 9 Again, net throughput maximization is 
formulated as an average reward problem. This problem differs from A(X,U, f x ,G) for perfect feedback 
only in the state transition kernel and award-per-stage. The transition PDF of z is obtained as 

E[/(zt+i \z t = e)], nt = 1, 



f e (zt+i I zt = c,nt) 



(8) 

f(zt+i = a | zt = c), otherwise 



where E denotes the expectation over the distribution of e. Thus for the corresponding average reward 

e 

problem, the state transition kernel is j% = / x f e . Given the receive SNR SNR e = Pge upon CSI 
feedback, the reward-per-stage function G e is modified from (5) as 

E[log 2 (l + Pge)} - a, M =l, 



G e (x,n) := < 

In Section V, we consider A(X,U, /J, G e ) and analyze the resultant optimal policy. 



(9) 

log 2 (l + Pgz), otherwise. 



IV. Feedback Control Policy: Perfect Feedback Channel 

In this section, the optimal feedback control policy is analyzed for the perfect feedback channel. To 
compute the policy using DP, the state space of the feedback controller is quantized. Given the discrete 
state space, the optimal control policy is proved to be of the threshold type. This policy structure is 
shown to also hold for the optimal feedback control with the continuous state space. 

A. State-Space Quantization 

The channel state (g, z) G X is quantized as (g, z) G X that is used as the input of the feedback 
controller, where X denote the discrete state space defined in the sequel. Feedback control with X 



9 For example, for the isotropic channel in Assumption 2 and a randomly generated codebook, the distribution function of e 
is Pr(e > 6) = (1 - e) 1 -- 1 [40], [42]. 
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allows applying stochastic optimization theory to analyzing the optimal control policy in the sequel. 10 
The algorithms for quantizing (g, z) are described as follows. 

The space of g, namely the nonnegative real line R+, is partitioned into M line segments [go,9i), 
[<7i,52), [5m-i,oo), where g = and < gi < g 2 < ■ ■ ■ < gu-i < 00. The values of {g m } 
are chosen such that g lies in different line segments with equal probabilities. In other words, Pr(G £ 
[5m,5m+i)) = ]gV0<m<M - 1 with gM = 00. The above M line segments are represented by 
a set of M finite values Q = {5o,5i, ■ ■ • ,5m-i} called grid points [43], which are arbitrarily selected 
from corresponding segments and hence satisfy the constraints g m G [5m,5m+i] V m. 11 Similarly, 
the space of z, namely the line segment Z = [0,1], is divided into N sub-segments of equal length 
[zq, zi), [zi, Z2), ■ ■ ■ , [zn-i, zn] where zq = and zn = 1. 12 Define the set Z := {z n }^ =0 . Again, N 
grid points Z = {zq, z±, ■ ■ ■ , zn-i} are arbitrarily chosen from the iV sub-segments mentioned earlier. 
The discrete state space can be readily written as X := Q x Z. The space X can be mapped to X using 
the following quantization functions Q g and Q z 

g = Q g {g) = g m , g£[g m ,g m +i) (10) 
z = Q z {z) = z n , ze[z n ,z n+1 ). (11) 

Note that the above quantization algorithms are used for simplicity and only one of many designs that 
lead to the same results as obtained in the following sections. 13 

Given Assumption 1, the sequences {gt} and {z t } are two Markov chains with the discrete state spaces 
Q and Z, respectively. The transition probabilities of the two Markov chains are decoupled as a result 
of (6). For {z t }, let P m ,n denote the probability for transition from the state m to n and ft the feedback 
decision corresponding to quantized controller input. Then P m ,n can be written as a function of fj, 

II f(zt+i = t \ z t = l)dr, n = l 
i+i _ (12) 
J f(zt+i = t\ z t = z m )dT, otherwise 

10 No comprehensive theory exists for the average cost/reward problem with an infinite or continuous state space [29]. 

"The grid points in the spaces of g and z can be adjusted to yield a better approximation of the optimal policy for the 
continuous state space. However, such an adjustment has no effect on the analysis in the sequel. 

l2 The line sub-segments are chosen to have equal length rather than equal probability since the distribution of z depends on 
the optimal feedback control policy and is unknown at this stage. 

13 Specifically, other quantization algorithms also lead to Theorem 1 and Proposition 1 if d B defined in (45) and Pr(g > gui-i) 
converge to zero with M, N — > 00. See the proofs of Theorem 1 and Proposition 1 for details. 
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where < m, n < N — 1. Note that P m , n (l) is independent of n. Similarly, define the counterpart of 
for {g t } as 

f9n + l _ 

Pm,n ■■= / /(fft+1 = T | 5* = ffm)dT (13) 

■'Sn 

where < m, n < M — 1. Note that P m , n are unaffected by feedback control. For convenience, define 
the transition probability matrix P with [P] mjTl := P m ,n and similarly P with [P] m>n := P m ,n- Due 
to feedback control, the stationary probabilities of z depend on those of g. Thus we define the joint 
stationary probability TT m ^ n := Pr(g = g m , z = z n ) where < m < M — 1,0 < n < N — 1. With the 
discrete state space, the state transition kernel is denoted as P x := P x P. 

The average reward problems A(X,U, P x , G) and A(X,U, f x , G) are considered in Section IV-B and 
IV-C, respectively. Let J* denote the maximum average reward for the discrete state space. Then J* 
is an approximation of J* for the continuous state space. They converge as the quantization resolution 
increases: M — > oo, N — > oo (cf. Section IV-C). 

B. Policy for Discrete State Space 

This section focuses on A(X, U, P x , G). The resultant optimal policy V* is shown to be of the threshold 
type. In addition, the computation of V* is discussed. 

Rather than obtaining V* directly, the policy Vp is derived by solving T>(X, U, P x , G). Then the 
desired V* follows from Vp by allowing (3 —>■ 1. The policy Vt can be found using DP [29]. To this 
end, define the DP operator F on a given function q : X — > IR + as 



(Fg)(5m,^n) = max G(g m ,z n ,fi) + /?V ] q(k,e)P k , m P e , n (n) 
Ate{o,i} l — 



(14) 



where < m < M — 1 and < n < N — 1. The maximum discounted reward satisfies Bellman's 
equation = FJ^ [29]. For convenience, represent Jf3((g m ,z n )) by Jg(m,n). 
We refer to a iV x iV stochastic matrix A as being montone if A satisfies 14 

AT-l AT-l 

[A] m , ni > [A] m ,„ 2 if < n 2 < n 2 < N - 1 (15) 

m=m m=m 

where < mo < iV — 1. Thus P is monotone following Assumption 3 and (12). Moreover, define a 
monotone vector of real numbers as one whose elements are in the ascending order. The following lemma 
is useful for the analysis in this paper. 
Lemma 1: 

14 In this paper, a stochastic matrix refers to the right stochastic matrix that comprises nonnegative elements and the sum of 
each column is equal to one [44]. 
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1) Consider a real vector v and a stochastic matrix A that are both monotone and have the same 
height. Then A^v is a monotone vector; 

2) Consider a matrix B of nonnegative elements and a matrix C with monotone rows. Then the rows 
of BC are also monotone. 

Proof: See Appendix A. □ 

The following lemma is essential for obtaining the main result of this section. The proof of Lemma 2 
is based on value iteration [29]. Using this method, for an arbitrary function q : X — > R + , the maximum 
discounted reward can be computed iteratively as 

Jg(m,n) = lim (F k q)(m, n). (16) 

k— >oo 

Lemma 2: J| has the following properties: 

1) Given g, J^(g,z) monotonically increases with z; 

2) Define w(m,n,fi) := Ylk I Jp(k-> fyPm,kPn,£(v)- Given m and //, io(m, n,/x) monotonically in- 
creases with n; 

3) Given m, w(m, n, 1) > w(m, n, 0) V n. 

Proof: See Appendix B. □ 

Using the above lemma, the main result of this section is obtained as shown in the following theorem. 
Theorem 1: The optimal policy V* is of the threshold type. Specifically, there exists a function y : 
Q — > Z such that 

'o, z>y{g) 



V* : /x 



(17) 
1 , otherwise. 



The function y(-) is bounded as 



2~ a (l + Pg) - 1 



< y(g) < 1. (18) 



Pg 

Proof: See Appendix C. □ 

Several remarks are in order. 

1) Why the optimal policy has the threshold structure is explained as follows. Small z corresponds 
to outdated transmit CSI and vice versa. As a result, CSI feedback for small z yields significant 
reward-per- stage but that for large z may result in negative reward-per-stage due to the feedback 
cost. Therefore the optimal feedback policy should enable feedback only in the regime of small z, 
resulting in the threshold-type policy. In addition, the feedback threshold on z depends on g since 
the reward-per-stage is a function of g. 
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2) One would expect that larger g makes feedback more desirable because the resultant reward is also 
larger (cf. (5)). In other words, the threshold function y should monotonically increase with g. This 
is observed in simulation. However, proving this property requires making an assumption on the 
temporal correlation of g, which, however, is unnecessary for this work. 

3) The threshold lower bound in (18) corresponds to the feedback control policy that enables feedback 
whenever it gives larger reward-per-stage than no feedback. However, this policy is suboptimal 
because feedback may lead to extra reward in subsequent slots despite providing a smaller reward- 
per-stage in the current slot than no feedback. Therefore the optimal policy should support more 
frequent feedback than the above suboptimal one. This is the reason that the optimal threshold 
function is lower bounded as shown in (18). 

4) For a high transmit SNR (P — > oo), the award-per-stage in (5) can be approximated as 

log 2 (l + Pg) - a, n=l, 
G(g,z,fi) = { (19) 
log 2 (l/z + Pg) + log 2 z, otherwise 

\og 2 (Pg)-a, n=l, 

(20) 

log 2 (.Pg) + log 2 z, otherwise. 
Define AG(g,z) := G(g,z,0) - G(g,z,l). From (20), AG(g,z) « log 2 z + a for P -► oo. 
The optimal policy V* essentially depends only on AG(g, z) and the dynamics of i. Both factors 
are independent of g for P —> oo. Consequently, the optimal threshold on z is insensitive to the 
variation on g for high SNR's. This is confirmed by simulation as discussed in Section VI. 
For the extreme cases of zero and infinite feedback prices, the feedback threshold is specified in the 
following corollary. 

Corollary 1: The feedback threshold y is fixed at y = for a = oo and y = 1 for a = 0. 
Proof: See Appendix D □ 

Note that y = and y = 1 correspond to no feedback and feedback in every time slot, respectively. The 
above results agree with the intuition that feedback is undesirable when the feedback price is too high 
but feedback should be performed persistently if it is free. 

Given the threshold function y defining V* (cf. Theorem 1), the maximum award can be obtained as 

M-\ N-l 

(21) 



m=0 n=0 

where 



Mr, 



0, z n > y{g m ), 

(22) 

1 , otherwise 
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and the stationary probabilities {iTm,n\ are obtained by solving the following linear equations [44] 

^m,n = S ^Pm,kPn/{(i-m,n)^k,£ and ^ 7r m , n = 1. (23) 
k,£ m,n 

Finally, we discuss the computation of V*. Let y denote the N x 1 threshold vector with [y] n = 
y{g n )- Then y determining V* can be computed either by an exhaustive search or policy iteration [29]. 
Using Theoreml and (21), the brute force approach is specified by y = max xe v ^*(x) where V = 
|x € Z M | [x] m > ^- — ^pg^ 1 ^ |. For this approach, the threshold structure of V* is exploited 
to reduce the complexity of the exhaustive search from 0(2 MN ) to 0(|V|) = 0(N M ). However, 
this complexity is still too high if M and N are large. For this case, a more practical approach for 
computing y is policy iteration [29]. Each iteration comprises two steps, namely policy evaluation and 
policy improvement. The policy evaluation in the zth iteration is to evaluate a given policy by 
computing the average reward and a set of parameters {A m n } called differential rewards as follows [29] 

= G(g m ,Z n ,Vrn,n) + J2 A %A,rnPe,n(Vrn,n), V m, n (24) 

k,e 

^M-l,N-l = (25) 

where p, m<n denotes the decision for the state (g, z) = (g m ,z n )- The subsequent policy improvement is 
specified by 



xe{o,i} 



V m, n. (26) 



G(g m , z n , x) + ^2 A k/Pk, m Pe,n( x ) 
k,£ 

(i+1) (i) 

The policy iteration terminates if fim,n = lAn,n V m, n. For the simulation in Section VI, the policy 
iteration converges typically within several iterations. 

C. Policy for Continuous State Space 

In this section, we consider the case where (g, z) is directly used as the controller input and design 
the controller by solving A(X,U, f x , G). The resultant optimal feedback control policy V* is proved to 
be of the threshold type. Specifically, we show that the threshold structure of V* as given in Theorem 1 
holds in the limit of high quantization resolution (M — > oo and TV — > oo). 

The proof of this result uses those in [43], which addresses the validity of approximately solving a 
discounted-reward (or discounted-cost) problem with a continuous state space by quantizing the space 
and using DP. A key result in [43] states that the approximate solution converges to the continuous-space 
counterpart as the space-quantization error reduces to zero. This requires that the reward-per-stage and 
the state transition kernel are Lipschitz continuous. To state this result mathematically, some notation 
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is introduced. Consider an infinite-horizon discounted reward problem with a compact state space X' 
and a finite control space U' . Let x' G X' denote the state with a transition PDF f{x' t+1 | x' t , fj,' t ) 
for n' t G W. Given a set of grid points X' in X', x' G results from quantizing x', namely that 



x 



Q(x') := min ag ^ ( \\x' — a\\ . Let the matching state transition kernel be represented by P^. Define 
the maximum quantization error as d s := max ie ^' max £ , e ^, — x'\\. Let Ep and Ep denote the 
discounted rewards obtained by solving V(X',U', f,G) and V(X' ,U' ,P' X ,G), respectively, where G is 
a reward-per-stage function. A key result in [43] is stated in the following lemma. 

Lemma 3 ([43]): Assume the reward-per-stage function G and / satisfy the following Lipschitz con- 
ditions 

\\6(a,^)-6(b,^)\\ < V\\a-b\\ 
\\f(x t+ i \x t = a, fit) - f(x t+ i | x' t = b, /it) || < W||o-6|| 
where a,b G X, and V and W are positive constants. Then 



lim sup Ep{x') - Ep{Q{x')) 



0. (27) 
Lemma 3 cannot be directly applied to extending the threshold structure of V* in Theorem 1 to the 
continuous-space counterpart V*. The reason is that the continuous state space X is unbounded and thus 
not compact. As a result, it is not guaranteed that d s — > for M, N — > oo, which, however, is required 
for the convergence in (27). 

The main result of this section is given in the following proposition. To overcome the mentioned 
difficulty on directly applying Lemma 3, the proof of Proposition 1 uses a dummy stochastic optimization 
problem with a bounded and continuous state space. The average reward of this problem is shown to 
converge to that of the target problem with a unbounded state space as the quantization resolution 
increases, proving the desired result. 

Proposition 1: If the transition PDF's f(gt+i | gt) and f(zt+i \ z t ) are Lipschitz continuous, the 
optimal policy V* is of the threshold type. Specifically, there exists a function y : Q — > Z such that 

0, z>y(g) 

1 , otherwise. 



(28) 



Moreover, y(-) is bounded as 



2- Q (l + Pg) - 1 



< y{g) < l. (29) 



Pg 

Proof: See Appendix E. □ 
We offer the following remarks. 
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1) The feedback probability Pr(/i = 1) is strictly larger than zero based on the following argument. 
For an arbitrary value of z, there exist x > such that the rate function log 2 (l + Pg) — a > 
log 2 (l + Pgz) V g > x, corresponding to fi = 1 (cf. (5) and Lemma 2). Since g follows the 
chi-square distribution, Pr(g > x) > and thus Pr(/i = 1) > 0. This justifies the above claim. 

2) The continuous-space policy V* cannot be directly computed using DP but can be approximated 
by interpolating the discrete-space counterpart V* in Theorem 1 [29]. The approximation accuracy 
improves with increasing quantization resolution specified by M and N at the cost of rapidly 
growing computation complexity. 

V. Feedback Control Policy: Finite-Rate Feedback Channel 

A perfect feedback channel is assumed for the analysis in the preceding section. In this section, we 
consider a finite-rate feedback channel. The optimal feedback control policy is shown to remain as 
the threshold type. The maximum average reward for finite -rate feedback is shown to be equal to that 
for perfect feedback at an increased feedback price. Feedback control considered in this section has 
the discrete state space X as defined in Section IV-A. The results in this section can be extended 
straightforwardly to feedback control with the continuous state space X following the approach in 
Section IV-C. The details are omitted for brevity. 

Consider the average reward problem A(X,U, P x P e , G e ) that approximates A(X,U, /J, G e ) formu- 
lated in Section III, where G e is in (9) and 

[P e (l)]m = J_ E[/(*t+i =r\zt = e))dT (30) 

and P e (0) = P(0). As specified in the following lemma, P e and G e are observed from (30) to have the 
same properties as their counterparts for the case of perfect feedback considered in Section IV. 
Lemma 4: 

1) For x = (g, z), G e (x,fi) monotonically increases with z; G e (x, 1) is independent of z. 

2) P e is monotone and has identical columns. 

Using Lemma 4, we have the following corollary of Theorem 1 . 

Corollary 2: For quantized feedback, the optimal feedback control policy V* resulting from solving 
A(X,U, P x P e ,G e ) is of the same threshold type as specified in Theorem 1. 

Given feedback inaccuracy due to quantization, feedback may not be always desirable even if it is free 
(a = 0). Thus the first claim in Corollary 1 does not hold for quantized feedback as confirmed by 
simulation (cf. Fig. 7). 
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It can be observed from (9) and (30) that quantized feedback affects both the award-per-stage and the 
dynamics of z. The joint effects on the maximum average reward cannot be characterized using simple 
expressions. To provide insight into these effects, they are analyzed separately. For this purpose, define 
the function J*(A, B) that gives the maximum average reward of A(X,U, P x B, A). Then the effects 
of feedback quantization are specified in the following proposition. 

Proposition 2: The function J*(-, •) satisfies the following inequalities: 

1) J*(G,P) > J*(G,P e ) > >(G e ,P e ) 

2) J*(G,P) > J*(G e ,P) > J*(G e ,P e ) 

3) J*(G e , P, a) >J*(G,P, a -E[log 2 e]) 

4) J*(G e , P e , a) > J*(G, P e , a - E[log 2 e]). 

Proof: See Appendix F. □ 

The inequalities in 1) and 2) state that both effects of finite-rate feedback on the award-per-stage and 
the dynamics of z reduce the maximum average reward with respect perfect feedback. As implied by the 
inequalities in 3) and 4), the reward reduction due to feedback quantization is equivalent to that caused 
by the increase on the feedback price by at most the amount of E[log 2 e]. For the specific distribution of 
e in Footnote 9 and \T\ 3> 1, this quantity can be approximated as [40], [42] 

E[log 2 e] «log 2 ex (1-E[e]) <log 2 ex |jr|-M^T. (31) 
Thus for \T\ — > oo, E[log 2 e] — > and the equalities in 3) and 4) of Proposition 2 hold. 

VI. Simulation Results 

In this section, additional insight into optimal feedback control are obtained from simulations results. 
In the simulation, the channel model follows Assumption 2. Their temporal correlation is specified by 
Clarke's function [45]. The state space for feedback control is quantized as discussed in Section IV- A 
with M = N = 16. The transmit SNR is 20 dB. 

In Fig. 3, the curves of net throughput versus feedback price are plotted for both the optimally 
controlled feedback and the conventional periodic feedback. The net throughput for controlled and periodic 
feedback are maximized by value iteration [29] and a numerical search over different feedback intervals, 
respectively. The Doppler frequency is fu = {0.1, 0.01} /T c and the number of transmit antenna L = 3. 
As observed from Fig. 3, the throughput for all cases decreases with the increasing feedback price. For 
high feedback prices, the curves flatten with net throughput fixed at 5.9 bit/s/Hz, corresponding to no 
feedback. Subtracting this value from net throughput gives the feedback gain as indicated in Fig. 3. 
Controlled feedback is observed to increase net throughput of periodic feedback by up to 0.5 bit/s/Hz 
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Fig. 3. Net throughput versus feedback price for both the optimally controlled and periodic feedback over the perfect feedback 
channel. The Doppler frequency is fn = {0.1, 0.01}/T C and the number of transmit antenna L — 3. 

or 24% of the feedback gain of about 2.1 bit/s/Hz. The increment in net throughput is insensitive to the 
change on Doppler frequency. Finally, for small feedback prices (a < 0.15), both feedback algorithms 
perform feedback in every slot and thus all curves in Fig. 3 overlap in this range. 

The comparison in Fig. 3 continues in Fig. 4 but for different numbers of transmit antennas L = {3, 4}. 
It is observed that the maximum net throughput gain for controlled feedback over the periodic feedback 
is about 0.5 bit/s/Hz for both L = 3 and L = 4. Thus this gain is insensitive to the change on L. 

Refer to Footnote 8. The mentioned function of maximum throughput versus average feedback rate 
(normalized for a = 1) is plotted in Fig. 5 for fjj = {0.1, 0.01}/T C and L = 3. Also plotted is the 
matching curve for periodic feedback obtained by a numerical search over different feedback intervals. 
As observed from the figure, for the same average feedback rate, optimal controlled feedback provides 
up to 0.5 bit/s/Hz higher throughput than periodic feedback. Alternatively, given identical throughput, 
the former can reduce the feedback cost by half with respect to the latter (cf. throughput = 7 bit/s/Hz 
and f D T c = 0.01). 

Fig. 6 displays the curves of net throughput versus feedback price for the perfect and finite-rate 
(quantized) feedback channels. The Doppler frequency is fjj = 0.1/T C and the number of transmit 
antennas L = 3. The codebook used for quantizing feedback CSI has the size of \F\ = 16 and is 
constructed using Lloyd's algorithm [8], [9]. As observed from Fig. 4, feedback quantization reduces net 
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Fig. 4. Net throughput versus feedback price for both the optimally controlled and periodic feedback over the perfect feedback 
channel. The Doppler frequency is fn = 0.1/T C and the numbers of transmit antenna L — {3, 4}. 




Normalized Average Feedback Rate 



Fig. 5. Throughput versus normalized average feedback rate for the optimally controlled and periodic feedback over the perfect 
feedback channel. The Doppler frequency is fo = {0.1, 0.01}/T C and the number of transmit antenna L — 3. 
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Fig. 6. Net throughput versus feedback price for the optimally controlled feedback over the perfect and quantized feedback 
channels. The Doppler frequency is fr> = 0.1/T C ; the number of transmit antennas L — 3; the codebook used for quantizing 
feedback CSI has the size of jjTj = 16. 

throughput slightly. This loss is larger for smaller a (more frequent feedback) and vice versa. 

The optimal control policies computed in simulation using policy iteration [43] have the same threshold 
structure as predicted by analysis in the preceding sections. This also validates Assumption 3 on the 
channel temporal correlation. The thresholds for these policies are observed to be insensitive to the 
variation on channel gain g (cf. Remark 4) on Theorem 1). For this reason, the feedback threshold 
on z is averaged over the range of g and plotted against the feedback price a in Fig. 7, where both 
perfect and finite-rate feedback channels are considered. The simulation parameters follow those for 
Fig. 4. As observed from Fig. 7, the average feedback threshold for the perfect feedback channel is 1 
for a = 0, corresponding to feedback for every time slot; the threshold converges to zero as a increases. 
Fig. 7 shows that feedback quantization reduces the feedback threshold slightly, implying less frequent 
feedback. Moreover, for a = 0, the average feedback threshold for quantized feedback is smaller than 
one, agreeing with the remark on Corollary 2. Last, note that the humps on the curves in Fig. 7 are 
caused by quantizing the controller's state space. 

VII. Conclusion 

In this paper, we have proposed the approach of controlling feedback for maximizing net throughput of 
transmit beamforming systems. The optimal control policy has been proved to be of the threshold type. 
Under this policy, feedback is performed when the angle between transmit beamformer and the channel 
exceeds a threshold, which varies with the channel power. The threshold-type optimal policy has been 
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Fig. 7. Average threshold on z versus feedback price a for the optimally controlled feedback over the perfect and quantized 
feedback channels. The Doppler frequency is fo = 0.1/T C ; the number of transmit antennas L — 3; the codebook used for 
quantizing feedback CSI has the size of \!F\ = 16. 

shown to apply to both quantized and continuous controller inputs and both perfect and finite-rate feedback 
channels. Feedback quantization has been found to decrease net throughput similarly as increasing the 
feedback price. As observed from simulation results, the optimal feedback control contributes significant 
net throughput gains without requiring additional bandwidth or antennas. 

The work opens several issues for future investigation. First, the closed-form expression of the op- 
timal feedback control policy can be derived by making additional assumptions on channel statistics. 
This allows direct policy computation rather than using the more complicated policy iteration method. 
Second, the controlled feedback approach can be extended to other types of multi-antenna systems with 
feedback such as precoded spatial multiplexing or multiuser MIMO. Feedback in these systems supports 
multiple operations such as spatial multiplexing, interference avoidance, and scheduling. As a result, 
the computation and analysis of optimal control policies are more challenging than those for single-user 
transmit beamforming considered in this paper. Last, considering bursty data makes it necessary to jointly 
control the forward-link queue and CSI feedback. Addressing this issue by extending the approach in [46] 
can establish an optimal tradeoff relation between feedback overhead, transmission power and queueing 
delay. 
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Appendix 

A. Proof of Lemma 1 

Let K denote the height of v and A. Define Ave '■= [v]e ~ [ v k-i with v -i := 0. Note that Av^ > 
for all £ since v is monotone. Using the above definition, we can write 

K-1 K-1 K-1 

[A f v] n = WWi,n = E E Av ri A kn- (32) 

1=0 r=0 l=r 

It follows that for n\ > n 2 

K-1 /K-1 K-1 \ 

[At V ] ni - [AtvU = J2 Av r E - E t A kn 2 

r=0 \£=r £=r / 

(a) 

> o 

where (a) follows from the monotonicity of A and that Av r > V r. The completes the proof of 1). 
For n\ > ri2, the difference between the nith and ri2th elements of the kth row of BC is 

it I 

(b) 

> (33) 

where (b) follows from the monotonicity of each row of C and that the elements of B are nonnegative. 
Then 2) follows from the above inequality. 

B. Proof of Lemma 2 

Consider fixed m, n\ and ri2 with n\ > n2 and a nonnegative function q(m, n) that increases 
monotonically with n. For instance, the all-zero function is a suitable choice. Based on the value iteration 
in (16), to prove the lemma, it is sufficient to show that (Fg)(m, n) is also a monotonically increasing 
function of n given m. Define // by 

(Fq)(m,n 2 ) = G(g m ,z n2 ,n') + (i^q{k, l)Pk, m Pe,n 2 (v')- 
Define the matrix Q with [Q]k,e = q(k,£). The above equation can be rewritten as 

(Fg)(m,n 2 ) = G(g m , z n2 , p') + /J[ptQP( M ')]m ) n a - (34) 
Assume fi' = 1. Then from (14) and (34) 
(Fg)(m,ni) - (Fq)(m,n 2 ) > G(g m , z ni , 1) + /3[P t QP(l)] 

m,ni 

G(g m ,z n2 ,l)-f3[&QP(l)] 

m,n 2 
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where (a) follows from that both G{g m , z n ,l) and -P^ n (l) are independent of n. Next, assume // = 0. 
It follows that 

(Fq)(m,n 1 )-(Fq)(m,n 2 ) > G(g m , z m , 0) + /3[ptQP(0)] ro , ni - G(g m , z n2 , 0) - /?[ptQP(0)] m>na 

> ^[P t QP(0)] /3[P f QP(0)] m,n 2 

(c) 

> (35) 

where (6) holds since G(g m , z n , jj!) is a monotonically increasing function of z n . (c) is due to that the 
matrix ptQP(O) has monotone rows, which results from Lemma 1 and that P(0) is monotone and Q 
comprises monotone rows. Combining above results proves the monotonicity of Fq and hence Jg. 

Proving the monotonicity of Ylki Jp(k,£)Pk,m,Pe,n uses that of J| as shown above. The proof proce- 
dure is similar to the above steps and thus omitted. 

C. Proof of Theorem 1 

Consider the optimal policy for maximizing the discounted reward. To simply notation, define 

AJ(m, n) :=log 2 (l + Pg m ) - log 2 (l + Pg m z n ) -a + Yl jp(k,t)P k , m P e , n (l)- 

k,e 

(36) 

J2j*p(k,£)P kim P e , n (0). 

k,e 

From Bellman's equation = FJ^ with F defined in (14), V^(g m , z n ) = 1 if AJ(m, n) > or otherwise 
Vp(g m ,z n ) = 0. Consider (mo, no) such that AJ(mo,no) < 0. For any n with no < n < N — 1, given 
that Pe, n (i) i s independent of n, it follows from (36) and Lemma 2 that AJ(mo,n) < 0. Therefore for 
each g E Q, there exists the matching r) G Z such that ^(fl 1 , -2) = V z > fj and Vp(g, z) = 1 V z < fj 
if r) > zq. Defining y as the mapping from g to fj proves that the optimal policy is of the threshold 
type with y being the threshold function. 

Next, the bounds in (29) are proved as follows. The upper bound is trivial given that < z < 1. By 
the above definition, y(g) can be written as 

y(g) = mm z s.t. A J(g, z) < (37) 

zez 

where by abuse of notation AJ(g m ,z n ) := AJ(m,n). From(36) and Lemma 2 

AJ(m,n) > log 2 (l + Pg m ) - log 2 (l + Pg m z n ) - a . (38) 

V v ' 

A J- (m,n) 
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It follows from (36) and Lemma 2 that AJ(m, n) is a monotonically increasing function of n. So is 
AJ~(m,n) from its definition. Therefore, from (37) and (38) 

via) > 7 (39) 

where 

7 = minz s.t. AJ~(d(g), d(z)) < 0. (40) 

Using (39) and solving for 7 using (40) proves that the lower bound in (29) holds for Vp. 

Given < (3q < 1> the properties for as proved above hold for any [3 G [/?o, !)• These properties 
must also exist for the optimal policy giving J* = lim / 3^ 1 (l — fl)Jp [29]. This completes the proof. 

D. Proof of Corollary 1 

The first claim is obviously valid since for a = 00 any feedback instant causes net throughput to be 
—00 and thus the optimal feedback controller should block feedback by using the threshold y = 0. The 
second claim holds since for a = 0, G(g, z, 0) < G(g, z, 1) V (<?, z) G X. Thus feedback should be 
performed in every time slot, corresponding to fixed y = 1. 

E Proof of Proposition 1 

A stationary feedback policy partitions the continuous state space X into two sets W and W c such 
that n = 1 V (g, z) G W and fi = V (5, z) G W c . To simplify notation, define W(g) :=Wn({j}x Z) 
and W c = W\W(5). Moreover, let f(z \ W) denote the PDF of z that depends on the set (policy) W. 
Note that the PDF f(g) of 5 is independent of W. 

To apply Lemma 3, we design a genie-aided dummy feedback-control system similar to the current 
one but with a bounded continuous state space. In the virtual system, the encoder is shut down by the 
genie whenever g > gu-i or otherwise turned on. The average reward for this system is / := J{G') 
where the reward-per-stage G' is defined in terms of G in (5) as 

G(g,z,n), g>g M -i 



G'(g,z,fi) 



(41) 

0, otherwise. 



The maximum reward /* can be written as 

9 m -1 



I*(M) = w max 2 J \ J G'(g, z, l)f(z \ W)dz + j G'(g, z, 0)f(z | W)dz \ f(g)dg. (42) 
[w(g) W°(g) I 
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Next, the reward I*(M) is shown to converge to J* as N increases. Similar to (42), 

oo 1 

r = w ™* z f< J G'(g,z,l)f(z\W)dz+ J G'(g,z,0)f(z\W)dz\f(g)dg 



W(g) 



< max 

wegxz 



9M-1 I i 

J | j G'(g,z,l)f(z\W)dz + J G'(g,z,0)f(z\W)dz\f(g)dg + 



\W(g) 
( 



W<=( 5 ) 



max 

w&gxz 



9 m -i 



J < J G'(g,z,l)f(z\W)dz+ J G'(g,z,0)f(z\W)dz[f(g)dg 



oo 

< r(M) + ^max^y | log 2 (l + Pg) J f(z | W)dz + log 2 (l + Pg) j f(z \ W)dz 

§M-1 \ 1 

OO 

= P{M)+ J log 2 (l + Pg)f{g)dg 



W(fl) 



W«(fl) 



(a) 

< J*(M) + 



9m -i 

E[log 2 (l + Pg)] 



(43) 



where (a) is obtained by applying Markov's inequality. Given that g follows chi-square distribution and 
Pr(<? > gM-i) = 5m-i — oo for M — > oo. Therefore, since J* and E[log 2 (l + Pg)] are finite, it 

follows form (43) that 

lim \J* - 7*(M)| = 0. (44) 

M^oo 

Next, consider the approximate feedback control optimization for the virtual system with a discrete state 
space. This space, denoted as S, is obtained using a quantization algorithm similar to that in Section IV- 
A, hence S = (G\{g~M-i}) x Let It and I* denote the maximum discounted and average rewards, 
respectively. For the above approximated problem, the maximum quantization error is given as 

d s = max max max max \J\q — q m \ 2 + \z — z n \ 2 . (45) 

0<m<M-2 0<n<7V-l g m <g<g m+ i z n <z<z„ +1 

Since d s —> as M, N —> oo and using the Lipschitz continuity of the conditional PDF's of (g, z) and 
the reward-per-stage function, it follows from Lemma 3 that 

lim \I*(M, N) - 7*| = lim lim (1 - f3)\I^g, z, M, N) - I* p (g, z)\ = (46) 

M,N— »oo 13— »1 N-*oo H ' 

From (44) and (46) and the triangular inequality 

lim \J*-f*(N)\< lim (\J* — I*(M) \ + \I* — I*(M,N)\] = 0. (47) 

M,N^oo M,N—>oo \ J 

Furthermore, for M, N — > oo, the results in Theorem 1 holds for the virtual system with the state space 
5. This completes the proof. 
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F. Proof of Proposition 2 

Proof of the inequalities in 1) and 2): The second inequality in 1) holds since G > G e from their 
definitions in (5) and (9). In the sequel, we prove the first inequality in 1) based on value iteration [29]. 
To this end, consider two nonnegative functions qi(g,z) and q 2 (g,z) that have the support Q x Z and 
monotonically increase with z. Furthermore, qi(g,z) > q 2 {g,z) V (g, z) G Q x Z, which is represented 
by q\ > <?2 for simplicity. Following the similar procedure as in the proof of Lemma 2, it can be shown 
that the functions F(G, P)(/i and F(G, P t )q 2 both monotonically increases with z, where F is in (14). 

Next, it is shown that F(G, P)qi > F(G, P e )q2- Let p, a and denote the control decisions that satisfy 



(48) 
(49) 



[F(G,P)q 1 ](g m ,z n ) = G(g m , z n , fi a ) + ^ qi(gk, ze)[P]k,m[P(Va)]e,n 
[F(G,P e )<? 2 ] {g m ,z n ) = G(g m ,z n ,v b ) + ^q 2 (gk,ze)[P]k,m\Pe(v>b)]e,n 

If H a = H b = 1, 

[F(G,P)gi - F{G,P e )q 2 ]{g m ,z n ) = log 2 (l + Pg m ) + ^k, ^)[P]fc,m[P(l)k,n 

k,e 

-E[log 2 (l + Pg m e)] - ^)[P] fe , m [P e (l)]^ 

k,£ 

> ^2qi(g~k,ze)[P]k,m[V(l)]e,n - ^^fe^^Pl^mPtt 1 )]^ 

ft j£ 



> ^2[P]k, m q2(g k , Z() {[P(l)h, n - [Pe(l)Un} • 

M 

where (a) follows from qi > q 2 . For < £q < N — 1 and from (12) and (30) 

JV-l N-l -i -i 

^[P(l)] £ , n -^[P e (l)],,„ = / f(z\z! = l)dz- E[f(z\z' = e)]dz 

£—£ a l—Q„ J Z—Zt n J Z — Z{ n 



(50) 



= E 

€ 

(b) 

> 0. 



J Z=S<>„ 



(z | z' = l)<i§ — / f(z | z' = e)<i§ 

' z=ze n J z=ze n 



(51) 



where (b) is due to Assumption 3. Using (50) and (51) and following the similar steps as in the proof 
for Lemma 2 leads to that F(G, P)<?i > F(G, P e )g 2 if /x a = /x 6 = 1. If /x = ^ = 0, since P e (0) = P(0) 
and qi>q 2 , 



[F(G,P)gi - F(G,P e )g 2 ](5 m ,^n) = Y,^k,z E ) - q 2 (g k ,z e )}[P} k>m [P(0)]e,n 

k,e 

> 0. 



(52) 
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From (14), the values of [F(G, P)(?i — F(G, P e )<?2] for (/x = 1,// = 0) and (/x = 0,/x = 1) are larger 
than those for (/x = /x = 0) and (/x = /x = 1), respectively. Combining the above results shows that 
F(G,P)gi >= F(G,P e )g 2 . 
Consequently, Jp(G, P) > Jp(G, P e ) since by value iteration 

JS(G,P) = lim F n (G,P)gi and jS(G,P e ) = lim F"(G,P e )g 2 . (53) 

M n— >oo H n— *oo 

As J* = lim / g_ > i(l — /3)Jg, the first inequality in 1) of the proposition statement is proved. 

The inequalities in 2) of the proposition statement can be proved also using the above procedure. 

Proof of the inequalities in 3) and 4): The inequalities in 3) and 4) can be proved using similar 
procedures. Thus we focus on proving that in 3). The reward-per- stage function in (9) can lower bounded 
below, where the fourth argument is the weighted feedback price 

G e (g,z,l,a) = E[log 2 (l/e + Ps)] + E[log 2 e] -a 

(a) 

> log 2 (1 + Pg) + E[log 2 e] - a 

= G(g,z,l,a-E[log 2 e}) (54) 
where (a) uses <5 < 1. Combining (54) and G e (g,z,0) = G(g,z,0) gives the desired result. 
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